Imagine that during the last 12 years you had put aside $200 every month in an investment account. How much money would that be now?
Let's answer this question by building our own portfolio and backtest tool using Python. We will start by loading some historical data using the Yahoo Finance API, explore and prepare a dataset, and then build a backtesting function to assess if investing those $200 every month using Dollar Cost Averaging (DCA) would have been a good idea.
The code used in this post is available on my GitHub, feel free to clone the project to follow along with the notebook. I tried to put everything in clear and concise functions that you can easily reuse in your program.
In the repository, you will also find a *README.md* explaining how to set up an environment with all the necessary dependencies.
# load lab_black for easy code formating
%load_ext lab_black
Let's first pick some financial assets we would like to invest in. I will use ETFs instead of specific stocks like Apple or Microsoft to avoid falling into the classical survivor bias trap and getting overly optimistic results.
An ETF is a collection of hundreds or thousands of stocks, bonds, or other assets, passively managed and available for trade on major stock exchanges. ETFs are perfect for long-term investing as they have low costs and are well diversified.
I will keep it nice and simple and build a portfolio that is using the common split of 80% stocks, and 20% bonds. The ticker SPY will be used for US stocks, and the ticker SPAB will be used for US bonds.
I am creating a dictionary with the two ETFs as keys (SPY and SPAB), and their associated portfolio weight as values (80% and 20% respectively).
# define the tickers and respective weights in portfolio
tickers = {
"SPY": {"weight": 0.8}, # 80% of portfolio
"SPAB": {"weight": 0.2}, # 20% of portfolio
}
# define start and end date of backtest period
start_date = "2010-01-01"
end_date = "2021-12-31"
We will be using the free Yahoo Finance API to pull historical prices and load the data into a Pandas dataframe, with each column containing the historical prices of the relevant ETF.
import yfinance as yf
import pandas as pd
def load_data(tickers: dict, start_date: str, end_date: str):
"""
This function pulls data from the Yahoo finance API into a pandas dataframe
"""
data = []
for ticker in tickers:
# pull open prices for ticker
df_ticker = yf.Ticker(ticker).history(start=start_date, end=end_date)[["Open"]]
# rename column with ticker name
df_ticker = df_ticker.rename(columns={"Open": f"{ticker}_price"})
# check if ticker data go back at least from the requested start_date
first_date = df_ticker.index.min()
if pd.to_datetime(start_date) < first_date:
raise ValueError(f"{ticker} data only available from {first_date}")
# append ticker data to all data
data.append(df_ticker)
# create dataframe with all tickers
df = pd.concat(data, axis=1)
# round prices to two decimals
df = df.round(2)
return df
# load the data
df_prices = load_data(tickers, start_date, end_date)
df_prices
| SPY_price | SPAB_price | |
|---|---|---|
| Date | ||
| 2009-12-31 | 88.42 | 19.27 |
| 2010-01-04 | 88.11 | 19.33 |
| 2010-01-05 | 88.81 | 19.38 |
| 2010-01-06 | 89.01 | 19.41 |
| 2010-01-07 | 88.99 | 19.40 |
| ... | ... | ... |
| 2021-12-23 | 463.38 | 29.20 |
| 2021-12-27 | 466.66 | 29.19 |
| 2021-12-28 | 472.25 | 29.22 |
| 2021-12-29 | 471.52 | 29.07 |
| 2021-12-30 | 472.46 | 29.10 |
3021 rows × 2 columns
Let's plot the historical prices of those two ETFs. I am using the Plotly backend to make quick interactive plots directly from the Pandas dataframe.
import pandas as pd
pd.options.plotting.backend = "plotly"
for ticker in tickers:
fig = df_prices.plot(y=f"{ticker}_price")
fig.show()
From those plots, we can see that in the last 12 years both the stock and bond US markets have grown quite significantly.
Given that the initial starting price of those two ETFs is different, it is hard to compare them over time. One solution for that is to normalize the prices by the first price occurrence, as done below.
# normalize prices by dividing by first period
df_prices_norm = df_prices / df_prices.iloc[0]
fig = df_prices_norm.plot()
fig.show()
In this plot, we can see the big difference in return between stocks and bonds.
We can also compute some basic statistics on the daily return. We can use the function pct_change() and describe() from Pandas to do that.
df_prices_norm.pct_change().describe()
| SPY_price | SPAB_price | |
|---|---|---|
| count | 3020.000000 | 3020.000000 |
| mean | 0.000606 | 0.000141 |
| std | 0.010105 | 0.003123 |
| min | -0.088014 | -0.045537 |
| 25% | -0.003543 | -0.001218 |
| 50% | 0.001141 | 0.000000 |
| 75% | 0.005554 | 0.001559 |
| max | 0.063491 | 0.053179 |
The mean return of SPY is much higher than the one of SPAB, as is the standard deviation (std).
Now that we have historical data, we will continue with some preprocessing before doing the backtesting.
To prepare our data for the simulation, we will resample them at day level so that we have data points for each ticker every day of the year, including weekends. Stocks and ETFs are not traded during weekends but imagine adding some cryptocurrencies to your portfolio, it then becomes important to also have prices on weekends. Doing this will also make the backtesting easier.
We can use the resample() function from Pandas together with the fillna() method using forward fill. The latest price available (let's say on Friday) is then forwarded to the days without prices (Saturday and Sunday).
def resample_data(df_prices: pd.DataFrame):
"""
This functions resamples the data to get prices for all days
"""
return df_prices.resample("1D").last().fillna(method="ffill")
# resample prices
df_prices = resample_data(df_prices)
df_prices.head(10)
| SPY_price | SPAB_price | |
|---|---|---|
| Date | ||
| 2009-12-31 | 88.42 | 19.27 |
| 2010-01-01 | 88.42 | 19.27 |
| 2010-01-02 | 88.42 | 19.27 |
| 2010-01-03 | 88.42 | 19.27 |
| 2010-01-04 | 88.11 | 19.33 |
| 2010-01-05 | 88.81 | 19.38 |
| 2010-01-06 | 89.01 | 19.41 |
| 2010-01-07 | 88.99 | 19.40 |
| 2010-01-08 | 89.30 | 19.42 |
| 2010-01-09 | 89.30 | 19.42 |
Now that we have a clean time series for all our tickers, we can continue with the backtesting.
The investment strategy we will simulate is called Dollar Cost Averaging (DCA). DCA is the practice of investing the same amount of money on a regular basis, irrespective of asset prices or market conditions. With DCA, we do not try to time the market, we simply keep investing the same amount for many years and build wealth over time. Like any other investment approach, DCA has many advantages but also some limitations, for more info, you can refer to this Investopedia page.
The goal is to see how our portfolio, composed of 80% stocks and 20% bonds, would have performed over the period 2010 - 2021.
This is the most complicated part of this post, I tried to put as many comments as I could to explain the different steps.
When looping through all the days of our backtest period, we will buy more shares of each ticker every 1st of the month using the monthly addition ($200 in our case). As we do that, we will update the total amount invested as well as the portfolio value accordingly.
Programmatically, I created a dictionary called record that will be updated as we go through each day. All records are then concatenated to create the final dataframe containing our portfolio.
It's good to note here that we take the assumption that we can buy new shares every 1st of the month. As we resampled the data at day level this is not a problem for our backtest, however, in practice, we might not have been able to do that given that the stock market is closed on weekends. We would have to wait for the next opening day to buy. This is something to keep in mind.
from copy import copy
def get_number_of_shares(amount: float, price: float, weight: float):
"""
This function calculates how much shares we can buy given a certain amount, price and weight
"""
return amount / price * weight
def get_portfolio(
tickers: dict,
df_prices: pd.DataFrame,
initial_amount: float,
monthly_addition: float,
):
"""
This function calculates the daily value of the portfolio:
- loop through each day
- if first of the month, then buy more shares using the monthly_addition
- update amount_invested and portfolio_value accordingly
"""
# check that tickers weights sum up to 1
if sum(d["weight"] for d in tickers.values()) != 1:
raise ValueError("tickers weights should sum up to 1")
# create empty list of records
records = []
# initialize first record with
# - first date
# - first prices available for each tickers
# - first number of shares for each tickers (using initial_amount)
# - first amount invested (initial_amount)
# - first portfolio value (initial_amount)
record = {
"date": df_prices.index[0],
**{
f"{ticker}_price": df_prices.iloc[0][f"{ticker}_price"]
for ticker in tickers
},
# shares
**{
f"{ticker}_shares": get_number_of_shares(
amount=initial_amount,
price=df_prices.iloc[0][f"{ticker}_price"],
weight=tickers[ticker]["weight"],
)
for ticker in tickers
},
"amount_invested": initial_amount,
"portfolio_value": initial_amount,
}
# loop through all days
for day in df_prices.index:
# update date
record["date"] = day
# update prices
for ticker in tickers:
record[f"{ticker}_price"] = df_prices.loc[day][f"{ticker}_price"]
# if first day of the month, update number of shares and amount invested
if day.day == 1:
# buy more shares using the monthly addition
for ticker in tickers:
record[f"{ticker}_shares"] += get_number_of_shares(
amount=monthly_addition,
price=record[f"{ticker}_price"],
weight=tickers[ticker]["weight"],
)
# the new amount invested is equal to the previous amount invested plus the monthly addition
record["amount_invested"] += monthly_addition
# update portfolio value, which is equal the number of shares * share price, summed up for every ticker
record["portfolio_value"] = 0
for ticker in tickers:
record["portfolio_value"] += (
record[f"{ticker}_shares"] * record[f"{ticker}_price"]
)
# append the record with other records
records.append(copy(record))
# create portfolio using all the records
df_porfolio = pd.DataFrame(records).set_index("date")
return df_porfolio
Let's run this function with an initial amount of $1000 and a monthly addition of $200.
initial_amount = 1000
monthly_addition = 200
df_portfolio = get_portfolio(tickers, df_prices, initial_amount, monthly_addition)
df_portfolio
| SPY_price | SPAB_price | SPY_shares | SPAB_shares | amount_invested | portfolio_value | |
|---|---|---|---|---|---|---|
| date | ||||||
| 2009-12-31 | 88.42 | 19.27 | 9.047727 | 10.378827 | 1000 | 1000.000000 |
| 2010-01-01 | 88.42 | 19.27 | 10.857272 | 12.454593 | 1200 | 1200.000000 |
| 2010-01-02 | 88.42 | 19.27 | 10.857272 | 12.454593 | 1200 | 1200.000000 |
| 2010-01-03 | 88.42 | 19.27 | 10.857272 | 12.454593 | 1200 | 1200.000000 |
| 2010-01-04 | 88.11 | 19.33 | 10.857272 | 12.454593 | 1200 | 1197.381521 |
| ... | ... | ... | ... | ... | ... | ... |
| 2021-12-26 | 463.38 | 29.20 | 147.511798 | 248.575322 | 29800 | 75612.416500 |
| 2021-12-27 | 466.66 | 29.19 | 147.511798 | 248.575322 | 29800 | 76093.769445 |
| 2021-12-28 | 472.25 | 29.22 | 147.511798 | 248.575322 | 29800 | 76925.817657 |
| 2021-12-29 | 471.52 | 29.07 | 147.511798 | 248.575322 | 29800 | 76780.847746 |
| 2021-12-30 | 472.46 | 29.10 | 147.511798 | 248.575322 | 29800 | 76926.966096 |
4383 rows × 6 columns
We now have a nice dataframe with the daily value of our portfolio, let's continue with some plots to see how our strategy would have performed.
Let's print and plot the total amount invested as well as the final portfolio value over time to see if investing in those two ETFs would have been a good idea.
def print_results(df_portfolio: pd.DataFrame):
"""
Print total amount invested and portfolio value
"""
print(
"Total Amount Invested: ${:0,.2f}".format(
df_portfolio["amount_invested"].iloc[-1],
)
)
print(
"Final portfolio value: ${:0,.2f}".format(
df_portfolio["portfolio_value"].iloc[-1],
)
)
print_results(df_portfolio)
fig = df_portfolio.plot(y=["amount_invested", "portfolio_value"])
fig.show()
Total Amount Invested: $29,800.00 Final portfolio value: $76,926.97
Nice! 🐷
For this specific period, it does seem that investing our money in the stock market would have been a very good idea. With the $29,800 invested in total, after 12 years we would have ended up with almost $77,000. That's more than doubling our money! This is definitely better than having those savings sleeping in the bank.
The compound return effect is especially important when taking longer time horizons. If you invest like this for 40+ years, you can easily end up with more than $1,000,000 in your account!
To make it easy to re-run those calculations with different assets, I created one function that runs all the above steps in one go.
Feel free to play around with different tickers, initial amount, or monthly addition!
def simulate_portfolio(
tickers: dict,
start_date: str,
end_date: str,
initial_amount: float,
monthly_addition: float,
):
df_prices = load_data(tickers, start_date, end_date)
df_prices = resample_data(df_prices)
df_portfolio = get_portfolio(tickers, df_prices, initial_amount, monthly_addition)
return df_portfolio
For example, let's build two portfolios, an aggressive one with 90% stocks and 10% bonds, and a more conservative one with a 50/50 ratio between stocks and bonds.
# portfolio 1: aggresive
tickers = {
"SPY": {"weight": 0.9},
"SPAB": {"weight": 0.1},
}
df_portfolio_aggressive = simulate_portfolio(
tickers, start_date, end_date, initial_amount, monthly_addition
)
# portfolio 2: conservative
tickers = {
"SPY": {"weight": 0.5},
"SPAB": {"weight": 0.5},
}
df_portfolio_conservative = simulate_portfolio(
tickers, start_date, end_date, initial_amount, monthly_addition
)
Let's plot the results.
import plotly.graph_objects as go
print("Aggressive")
print_results(df_portfolio_aggressive)
print()
print("Conservative")
print_results(df_portfolio_conservative)
fig = go.Figure()
fig.add_trace(
go.Line(
x=df_portfolio_aggressive.index,
y=df_portfolio_aggressive["portfolio_value"],
name="Aggresive",
)
)
fig.add_trace(
go.Line(
x=df_portfolio_conservative.index,
y=df_portfolio_conservative["portfolio_value"],
name="Conservative",
)
)
fig.show()
Aggressive Total Amount Invested: $29,800.00 Final portfolio value: $82,021.89 Conservative Total Amount Invested: $29,800.00 Final portfolio value: $61,642.22
The more aggressive portfolio offers an extra return of about 30% (20K), which is considerable. One thing to note here is that those calculations do not include dividend payments, which would make those numbers higher (especially for the portfolio with more bonds).
Which strategy to pick? This all depends on your risk aversion level, personal objective, and time horizon. Younger people who are OK to invest their money for 20+ years might favor the riskier approach, while people closer to retirement age might want to have a more secure portfolio given that they will need to unlock the money sooner.
You could also use portfolio metrics such as Sharpe ratio, Sortino ratio, etc. to determine which portfolio is better in terms of risk/reward.
Let's say you were a crypto fan, and instead of buying bonds, you invested your money into stocks and Bitcoin in equal shares.
# set start date to 2015-01-01 as BTC-USD prices are only available from end 2014
start_date = "2015-01-01"
end_date = "2021-12-31"
tickers = {
"SPY": {"weight": 0.5}, # 50% stocks
"BTC-USD": {"weight": 0.5}, # 50% Bitcoin
}
df_portfolio_crypto = simulate_portfolio(
tickers, start_date, end_date, initial_amount, monthly_addition
)
print_results(df_portfolio_crypto)
fig = df_portfolio_crypto.plot(y=["amount_invested", "portfolio_value"])
fig.show()
Total Amount Invested: $17,800.00 Final portfolio value: $467,839.39
In 7 years, your $17,800 invested would have become $467,839.
No comment 🚀
Of course, it's easy to look at past data and pick assets that skyrocketed. In practice though, building a strong portfolio with consistent return is much more complicated.
This is it, in this post we built a simple portfolio and DCA backtesting framework using Python. Feel free to play around with different assets, and time horizons or extend the framework with additional functionalities like transaction fees, dividend distribution, rebalancing, inflation-adjusted, or re-investing at different frequencies than monthly.
Check out my other articles about investing and Machine Learning and if you have any questions or remarks, drop a comment, BuyMeACoffee or follow me for more posts like this one!
You can also join me on the trading platform eToro, where I use similar techniques including Machine Learning at the core of my investment strategy.
This post contains an affiliate link to the social trading platform eToro